CPSC 545/445 (Autumn 2003) - Class 19:  Gene Expression Analysis & Regulatory Network Inference
Module 6, Part 3

----

brief recap:

- clustering methods for gene expression data
  -> many general clustering + classification methods,
	in many cases based on local search

  - clustering is (to some extent) an art
  - no universal, agreed-upon criteria for eval of solutions
  - no ultimate alg (problem too general; but true even for gene expression)
  - further improvement of alg needs more data (real + synthetic)
  [from conclusions of Shamir and Sharan]


---
6.3 cancer diagnosis: algorithms for class prediction and class discovery

Golub et al. Molecular Classification of Cancer: Class Discovery
and Class Prediction by Gene Expression Monitoring. Science 286, 531-537,
1999.

[reading assignment, not covered in class]

---
6.4 Inference of Regulatory Relationships from Gene Expression Data

background:

- many known regulators control transcription of mRNA
- identification of regulator/regulatee pairs forms basis 
	for inference of regulatory networks

- here: detect pairs based on microarray time series data
	= gene expression level profiles obtained from multiple 
	microarray experiments

[slide]

types of regulation between two genes X,Y:

  activation vs. inhibition:
	activation: X+ -> Y+
	inhibition: X+ -> Y-

  direction/causation: X regulates Y != Y regulates X
	intuitively: X regulates Y 
		-> features in X preceeds corresponding feaures in Y 
		(cause and effect)

  direct vs. indirect:
	X -> Y (no intermediaries)
	X -> Z -> Y
	X -> Z_1 -> .... Z_k -> Y

  unconditional vs. conditional:
	X -> Y (always)
	X -> Y in the presence/absence of Z 
	X -> Y in the presence of Z_1, ..., Z_k and absence of W_1, ..., W_m


three approaches:

- correlation analysis: use Pearson correlation coefficient
	r(X,Y) = ss_xy / sqrt(ss_xx*ss_yy), where
		ss_xx = sum_i[(x_i-avg(x))^2]
		ss_yy = sum_i[(y_i-avg(y))^2]
		s_xy = sum_i[(x_i-avg(x))(y_i-avg(y))]
	as indicator of similarity of expression profiles X,Y
  correlation coefficient measures strength of linear dependence between
	X,Y; -1 or 1 = perfect association; 0 = no association
  [activation -> r(X,Y) = ? same for inhibition?]
  
  widely used for analysis of gene expression profiles

  problems/weaknesses:
  - good for detecting global similarity, but not local similarity
	(which may be due to conditional regulatory relationships)
  - doesn't take into account temporal aspects
	-> only non-directional predictions
	doesn't account for consistent delays (e.g., due to indirect regulation)

- edge detection method by Filkov et al. (2001):
  basic idea: improve detection of local similarity by focussing
	on major changes in gene expression level (edges)
  high level algorithm outline:
  - determine major edges in each profile X,Y
  - remove spurious edges
  - compute score from pairs of edges in X,Y that have the same direction 
	(rising/falling) and are temporally close

  note: designed to deal with noise, time delays, local similarity;
	gives directional predictions

  [slide]

- event method by Kwon et al. (2003):
  basic idea: study relationship between major features in profiles

  high level algorithm outline:
  - abstract profiles into strings of (time stamped) events
	use smoothing and filtering to remove noise
  - align event strings (using variant of Needleman-Wunsch alg for global seq alignment)
  	to obtain score

  scoring matrix: 
	R	C	F
  R	S(dT)	0	-beta*S(dT)
  C 	0	0	0
  F	-beta*S(dT) 0 	alpha*S(dT)

  dT = time difference between aligned events

  note: designed to deal with noise, time delays, local similarity,
	known assymmetries between increases and decreases in expression level
	gives directional predictions
	

results:

- biological data:
  - yeast cell cycle data. Spellman et al., 1998
	contains profiles for all ORFs in yeast (> 6000)
	(problem: very limited temporal resolution - 16 time points;
		inaccuracies due to problems in cell cycle synchronisation)
	reduced to 458 genes involved in 888 known regulations
	(note: > 200 000 directional candidate interactions)
  - compare rankings of candidate pairs based on correlation, edge, event scores
  - result: 
	- number of true positives in top-k predictions
	  comparable between all three methods
	- very little overlap between true positive predictions of different methods
	  -> methods complement each other!

- artificial data:
  - allows to abstract from weaknesses in real data, specifically test
	dependence of performance on individual factors
	(e.g., amount of delay, degree of similarity, etc.)
  -> event significantly superior to correlation in most cases
	(especially for data with large delays, only local similarity)
  (edge couldn't be tested because of many 0 scores - no major edges detected)


[slides: andrew's thesis, 13,14,33,39,40,45,47,50]

--
piece together pairwise regulations to obtain full network
(many complications in the details, work in progress in BETA Lab @ UBC/CS)


--
different approach: Cheng et al. (2000)

identify entire network by
- building graph of good candidate pairs (from pairwise analysis)
- find subset of edges corresponding to true regulations 
	using combinatorial optimisation methods 
	and optimisation objective that models assumptions about
	general (local) structure of network
	(e.g., all regulatees have at least one activator and one inhibitor)
  (hard problem -> use SLS methods, here simulated annealing, to solve)

	
--
similarly, Baysian network learning methods have been applied
to the genetic regulatory network inference problem (Friedman et al., 2000)


---
Resources:

* Golub et al. Molecular Classfication of Cancer: Class Discovery and Class Prediction
by Gene Expression Monitoring. Science 286, 531-537, 1999.

* Andrew T. Kwon, Holger H. Hoos, and Raymond Ng:
Inference of transcriptional regulation relationships from gene expression data.
Bioinformatics 19:905-912, 2003. 


Further Reading:

Ben-Dor et al. Tissue Classification with Gene Expression Profiles, Recomb 2000.

M.J. van der Laan and J. Bryan. Gene Expression Analysis with the Parameter
Bootrap, Biostatistics, 2001

Kwon,A. (2002) Inference of transcriptional regulation relationships
with gene expression data, Master’s thesis, Department of Computer Science, UBC, 2002.


---